Introduction to Git
Introduction
This workshop covers the basics of using Git to track and record changes to files on your local computer. This allows you to compare versions over time, recall earlier changes you made, and effectively collaborate on writing code and documents.
This is a hybrid workshop. First, independently work through this online tutorial at your own pace and ensure that you have successfully installed Git on your local computer. Next, join us for the live Interactive Session (details below), during which we will use the Git Version Control System to learn and practice how to manage files on our local computer and work with Git. If you need help troubleshooting your installation, drop-in to DataLab’s virtual office hours prior to the Interactive Session to ensure that you will be able to follow along on your own machine.
The full workshop descritption can be found here.
Interactive Session Information
The workshop includes a live, interactive session to be held via Zoom on Friday, December 11, 2020 from 12:00 pm to 2:00 pm. Zoom login inofmration will be sent to all registered participants via email; reach out to datalab-training@ucdavis.edu the day prior if you received a registration confirmation but have not received your Zoom link.
About this Tutorial
This online tutorial provides background information that will help participants to better understand the concepts introduced during the Interactive Session. It also includes information to help you successfully install Git on your local computer, which must be completed prior to the Interactive Session.
Objectives for this Workshop
1. Describe the history of Version Control Systems (VCS) including its value and function;
2. Explain of how VCS mange files on your computer;
2. Successfully install and run the Git VCS on your local computer;
3. Use basic commands to interact with your computer using the Command Line;
4. Use basic basic Command Line commands needed to work with Git;
5. Successfully create local repositories;
6. Place files under version control;
7. Comapare multiple versions of the same file;
8. Roll-back to earlier versions of a file;
9. Perform basic branching and merging;
10. Begin utilizing basic Git Workflows;
11. Identify where to go to learn more.
What is Version Control?
Version control describes a process of storing and organizing multiple versions (or copies) of documents that you create. Approaches to version control range from simple to complex and can involve the use of various human workflows and/or software applications to accomplish the overall goal of storing and managing multiple versions of the same document(s).
Most people have a folder/directory somewhere on their computer that looks something like this:
Or perhaps, this:
This is a rudimentary form of version control that relies completely on the human workflow of saving multiple versions of a file. This system works minimally well, in that it does provide you with a history of file versions theoretically organized by their time sequence. But this filesystem method provides no information about how the file has changed from version to version, why you might have saved a particular version, or specifically how the various versions are related. This human-managed filesystem approach is more subject to error than software-assisted version control systems. It is not uncommon for users to make mistakes when naming file versions, or to go back and eit files out of sequence. Software-assisted version control systems (VCS) such as Git were designed to solve this problem.
Software Assisted Version Control
Version control software has its roots in the software development community, where it is common for many coders to work on the same file, sometimes synchronously, amplifying the need to track and understand revisions. But nearly all types of computer files, not just code, can be tracked using modern version control systems. IBM’s OS/360 IEBUPDTE software update tool is widely regarded as the earliest and most widely adopted precursor to modern, version control systems. Its release in 1972 of the Source Code Control System (SCCS) package marked the first, fully fledged system designed specifically for software version control.
Today’s marketplace offers many options when it comes to choosing a version control software system. They include systems such as Git, Visual Source Safe, Subversion, Mercurial, CVS, and Plastic SCM, to name a few. Each of these systems offers its twist on version control, differing sometimes in the area of user functionality, sometimes in how it handles things on the back-end, and sometimes both. This tutorial focuses on the Git VCS, but in the sections that follow we offer some general information about classes of version control systems to help you better understand how Git does what it does and help you make more informed decisions about how to deploy it for you own work.
Local vs Server Based Version Control
There are two general types of version control systems: Local and Server (sometimes called Cloud) based systems. When working with a Local version control system, all files, metadata, and everything associated with the version control system live on your local drive in a universe unto itself. Working locally is a perfectly reasonable option for those who work independently (not as part of a team), have no need to regularly share their files or file versions, and who have robust back-up practices for their local storage drive(s). Working locally is also sometimes the only option for projects involving protected data and/or proprietary code that cannot be shared.
Server based VCS utilize software running on your local computer that communicates with a remote server (or servers) that store your files and data. Depending on the system being deployed, files and data may reside exclusively on the server and are downloaded to temporary local storage only when a file is being actively edited. Or, the system may maintain continuous local and remote versions of your files. Server based systems facilitate team science because they allow multiple users to have access to the same files, and all their respective versions, via the server. They can also provide an important, non-local back-up of your files, protecting you from loss of data should your local storage fail.
Git is a free Server based version control system that can store files both locally and on a remote server. While the sections that follow offer a broader description of Server based version control, in this workshop we will focus only on using Git locally and will not configure the software to communicate with, store files on, or otherwise interact with a remote server. DataLab’s companion “Git for Teams” workshop focuses on using Git with the GitHub cloud service to capitalize on Git’s distributed version control capabilities.
Server based version control systems can generally be segmented into two distinct categories: 1) Centralized Version Control Systems (Centralized VCS) and 2) Distributed Version Control Systems (Distributed VCS).
Central Version Control Systems
Centralized VCS is the oldest and, surprisingly to many, still the dominant form of version control architecture worldwide. Centralized VCS implement a “spoke and wheel” architecture to provided server based version control.
With the spoke and wheel architecture, the server maintains a centralized collection of file versions. Users utilize version control clients to “check-out” a file of interest to their local file storage, where they are free to make changes to the file. Centralized VCS typically restrict other users from checking out editable versions of a file if another user currently has the file checked out. Once the user who has checked out the file has finished making changes, they “check-in” their new version, which is then stored on the server from where it can be retrieved and “checked-out” by another user. As can be seen, Centralized VCS provide a very controlled and ordered universe that ensures file integrity and tracking of changes. However, this regulation comes at a cost. Namely, it reduces the ease with which multiple users can work simultaneously on the same file.
Distributed Version Control Systems
Distributed VCS are not dependent on a central repository as a means of sharing files or tracking versions. Distributed VCS implement a network architecture (as opposed to the spoke and wheel of the Centralized VCS as pictured above) to allow each user to communicate directly with every other user.
In Distributed VCS, each user maintains their own version history of the files being tracked, and the VCS software communicates between users to keep the various local file systems in sync with each other. With this type of system, the local versions of two different users will diverge from each other if both users make changes to the file. This divergence will remain in place until the local repositories are synced, at which time the VCS stitches (or merges) the two different versions of the file into a single version that reflects the changes made by each individual, and then saves the stitched version of the file onto both systems as the current version. Various mechanisms can then be used to resolve the conflicts that may arise during this merge process. Distributed VCS offer greater flexibility and facilitate collaborative work, but a lack of understanding of the sync/merge workflow can cause problems. It is not uncommon for a user to forget to synch their local repository with the repositories of other team members and, as a result, work for extended periods of time on outdated files that don’t reflect their teammates and result in work inefficiencies and merge challenges.
The Best of Both Worlds
An important feature of Distributed VCS is that many users and organizations choose to include a central server as a node in the distributed network. This creates an hybrid universe in which some users will sync directly to each other while other users will sync through a central server.
Syncing with a cloud-based server provides an extra level of backup for your files and also facilitates communication between users. But treating the server as just another node on the network (as opposed to a centralized point of control) puts the control and flexibility back in the hands of the individual developer. For example, in a true Centralized CVS, if the server goes down then nobody can check files in and out of the server, which means that nobody can work. But in a Distributed CVS this is not an issue. Users can continue to work on local versions and the system will sync any changes when the server becomes available. Git, which is the focus of this tutorial, is a Distributed VCS. You can use Git to share and sync repositories directly with other users or through a central Git server such as, for example, GitHub or GitLab.
VCS and the Computer File System
When we think about Version Control, we typically think about managing changes to individual files. From the user perspective, the File is typically the minimum accessible unit of information. Whether working with images, tabular data, or written text, we typically use software to open a File that contains the information we want to view or edit. As such, it comes as a surprise to most users that the concept of Files, and their organizing containers (Folders or Directories), are not intrinsic to how computers themselves store and interact with data. In this section of the tutorial we will learn about how computers store and access information and how VCS interact with this process to track and manage files.
How Computers Store and Access Information
For all of their computing power and seeming intelligence, computers still only know two things: 0 and 1. In computer speak, we call this a binary system, and the unit of memory on a hard-disk, flash drive, or computer chip that stores each 1 or 0 is called a bit. You can think of your computer’s storage device (regardless of what kind it is) as a presenting a large grid, where each box is a bit:
In the above example, as with most computer storage, the bits in our storage grid are addressable, meaning that we can designate a particular bit using a row and column number such as, for example, A7, or E12. Also, remember, that each bit can only contain one of two values: 0 or 1. So, in practice, our storage grid would actually look something like this:
All of the complex information that we store in the computer is translated to this binary language prior to storage using a system called Unicode. You can think of Unicode as a codebook that assigns a unique combination of 8, 16, 32, 64, etc. (depending on how old your computer is) ones and zeros to each letter, numeral, or symbol. For example, the 8-bit Unicode for the upper case letter “A” is “01000001”, and the 8-bit Unicode character for the digit “3” is “00110011”. The above grid actually spells out the phrase, “Call me Ishmael”, the opening line of Herman Melville’s novel Moby Dick.
An important aspect of how computers story information in binary form is that, unlike most human readable forms of data storage, there is no right to left, up or down, or any other regularized organization of bits on a storage medium. When you save a file on your computer, the computer simply looks for any open bits and starts recording information. The net result is that the contents of single file are frequently randomly interleaved with data from other files. This mode of storage is used because it maximizes the use of open bits on the storage device. But it presents the singular problem of not making data readable in a regularized, linear fashion. To solve this problem, all computers reserve a particular part of their internal memory for a “Directory” which stores a sector map of all chunks of data. For example, if you create a file called README.txt with the word “hello” in it, the computer would randomly store the Unicode for the five characters in the word “hello” on the storage device and make a directory entry something like the following:
Understanding the Directory concept and how computers store information is crucial to understanding how VCS mange your Files.
How VCS Manage Your Files
Most users think about version control as a process of managing files. For example, if I might have a directory called “My Project” that holds several files related to this project as follows:
One approach to managing changes to the above project files would be to store multiple versions of each file as in the figure below for the file analysis.r:
In fact, many VCS do exactly this. They treat each file as the minimum unit of data and simply save various versions of each file along with some additional information about the version. This approach can work reasonably well. However, it has limitations. First, this approach can unnecessarily consume space on the local storage device, especially if you are saving many versions of a very large file. It also has difficulty dealing with changes in filenames, typically treating the same file with a new name as a completely new file, thereby breaking the chain of version history.
To combat these issues, good VCS don’t actually manage files at all. They manage Directories. Distributed VCS like Git take this alternate approach to data storage that is Directory, rather than file, based.
Graph-Based Data Management
Git (and many other Distributed VCS) manage your files as collections of data rather than collections of files. Git’s primary unit of management is the “Repository,” or “Repo” for short, which is aligned with your computer’s Directory/Folder structure. Consider, for example, the following file structure:
Here we see a user, Tom’s, home directory, which contains three sub directories (Data, Thesis, and Tools) and one file (Notes.txt). Both the Data and Tools directories contain sub files and/or directories. If Tom wanted to track changes to the two files in the Data directory, he would first create a Git repository by placing the Data directory “under version control.”
When a repository is created, the Git system writes a collection of hidden files into the Data Directory that it uses to store information about all of the data that lives under that directory. This includes information about the addition, renaming, and deletion of both files and folders as well as information about changes to the data contained in the files themselves. Additions, deletions and versions of files are tracked and stored not as copies of files, but rather as a set of instructions that describes changes made to the underling data and the directory structure that describes them.
Installing Git
In order to run Git version control and be ready for the Interactive Session of this workshop, you need to install it on your local machine. This is required preparation and we will not have time during the Interactive Session to help you troubleshoot installation issues. If you don’t have Git installed, you won’t be able to follow along with the activities. Git installation is typically an easy, point and click process, but there are some configuration steps along the way to which you’ll need to pay attention and thus we recommend that you try this well in advance of the workshop so you have time to troubleshoot your install if necessary.
Git on Windows
Follow these step-by-step instructions if you’re installing Git on a Windows machine:
First, launch a web browser, the image below shows the Microsoft Edge browser:
Next, navigate to the following Git download URL in your browser https://git-scm/com/downloads:
Select “Windows” from the Downloads portion of the Git webpage. Git will diplay the following page and automatically being downloading the correct version of the Git software. If the download doesn’t start automatically, click on the “Click here to download manually link”:
When the download is complete, open/Run the downloaded file (will look different in different browsers, but everyone shoudl know how to do this):
A screnn will appear asking for permissions for the Git application to make changes to your device. Click on the Yes button:
Click Next to accept the user licence:
Leve the default “Destination Location” unchanged (usually C:Files) and hit Next
You will see a screen like the one below asking you to “Select Components”:
Leave all of the default components selected and also check the boxes next to “Additional Icons” and it’s sub-item, “On the Desktop”. Your completed configurations window should have the following compenents selected:
Additional Icons
-> On the Desktop
Windows Explorer integration
-> Git Bash Here
-> Git GUI Here
Git LFS (Large File Support)
Associate .git* configuration files with default text editor
Associate .sh files to be run with Bash
And should look like this:
After verifying that you have the necessary components selected as per above, hit Next.
The next screen will ask you to “Select a Start Menu Folder.” Keep the default value of Git and hit Next:
Leave the default “Use Vim (the ubiquitous text editor) as Git’s default editor” selected o nthe “Choosing the default editor used by Git” screen and hit Next:
On the next screen, leave the default “let Git decide” option selected and hit Next:
Leave the default “Git from the command line and also from 3rd-party software” selected and hit Next:
On the next “Choosing HTTPS transport backend” page leave the default “Use the OpenSSL library” option selected and hit Next:
Leave the default “Checkout Windows-style, commit Unix-style line endings” selected on the next page and hit Next:
Keep the default “Use MinTTY (the default terminal of MSYS2)” selected on the “Configuring the terminal emulator to use with Git Bash” window and hit Next:
Keep the default value of “Default (fast-forward or merge)” on the “Choose the default behavior of ‘git pull’” page and hit Next:
Keep the default value of “Git Credential Manager Core” on the “Choose a credential helper” page and hit Next:
Keep the default values on the “Configuration extra options” page by keeping “Enable file system caching” checked and “Enable symbolic links” unchecked and then hit Next:
Make sure that no options are checked in the “Configuring experimental options” page and hit Install:
After you hit this Install button as per above, you will see an install progress screen like the one below:
When the install is complete, a new, “Completing the Git Setup Wizard” window like the one below will appear:
Make sure that all of the options on this window are unchecked as in the image below and then hit the Finish button:
This will complete your installation process.
Windows users should verify that when downloading Git for Windows they have also installed Git Bash, which is necessary for working with Git in command line.
Git on Mac
If you are installing Git on a Mac, there is no extra configuration. Simply go the git download page at https://git-scm.com/downloads and choose the latest version for mac, and run the installer package when it is finished downloading. If you get an “unknown developer” warning during the install process, follow the instructions at the beginning of the video at https://www.youtube.com/watch?v=__kr-Ew5kbE to help you work through this problem.
Verifying Your Install
Whether you’re installing on Windows or Mac, note that unlike most applications that you’ve installed before, you will not find a “Git” application in your programs or applications directory once the installation is complete. As long as you don’t get an explicit error message during the installation process, you can assume that the software was successfully installed. Git is a command-line application with which you interact using the command-line, which we’ll cover during the interactive session. If you’re already familiar with using command line, you can verify your install by opening the terminal (for Windows that will be Git Bash) and type git –version. You should then see a response of your installed version (e.g., git version 2.12.2.windows.2, or git version 2.12.2.mac.2), and not the error “command not found.” If you aren’t familiar with command line we’ll cover this during the Interactive Session.
Installation Troubleshooting
If you are not able to successfully intall Git on your own, please attend DataLab’s Virtual Office Hours, which are held every Monday afternoon from 1:30 to 3:00 pm, to get help with your installation. Click here for more information and to receive a Zoom link.
Ready, Set, Go…
If you’ve read and understood the information in this online turorial and successfully installed Git on your local machine, you’re ready for the Interactive Session!
Additional Resources
TheGit Book is the defintive Git resource and provides an excellent reference for everythign that we will cover in the Interactive session. There is no need to read the book prior to the session, but it’s a good reference resource to have avaialable as you begin to work with Git after the workshop.
Click here to see full version with exhibits used in live course session.